As sharing images in an instant message is a crucial factor, there has been active research on learning a image-text multi-modal dialogue model. However, training a well-generalized multi-modal dialogue model is challenging because existing multi-modal dialogue datasets contain a small number of data, limited topics, and a restricted variety of images per dialogue. In this paper, we present a multi-modal dialogue dataset creation pipeline that involves matching large-scale images to dialogues based on CLIP similarity. Using this automatic pipeline, we propose a large-scale multi-modal dialogue dataset, DialogCC, which covers diverse real-world topics and various images per dialogue. With extensive experiments, we demonstrate that training a multi-modal dialogue model with our dataset can improve generalization performance. Additionally, existing models trained with our dataset achieve state-of-the-art performance on image and text retrieval tasks. The source code and the dataset will be released after publication.
translated by 谷歌翻译
Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation.
translated by 谷歌翻译
预计未来几十年的全球粮食不安全将加速气候变化率和人口迅速增加。在这种静脉中,重要的是在每种饮食生产水平上消除效率低下。最近深入学习的进步可以帮助降低这种效率低下,但他们的申请尚未成为整个行业的主流,以大规模的规模诱导经济成本。为此,已将现代技术(如CNNS(卷积神经网络)应用于RPQD(原始产生质量检测)任务。另一方面,变压器在其他方式中的视野中的成功首次亮相使我们能够在RPQD中预计这些基于变压器的模型更好的性能。在这项工作中,我们专门调查了最近的最先进的水流(移位的Windows)变压器,这些变压器可以在窗口和窗口间的方式中计算自我关注。我们将Swin变压器与CNN模型进行比较四个RPQD图像数据集,每个CNN模型都包含不同种类的生成:水果和蔬菜,鱼类,猪肉和牛肉。我们观察到Swin Transformer不仅实现了更好或更有竞争力的性能,而且还具有数据和计算效率,使其成为现实世界的实际部署的理想选择。据我们所知,这是第一个对RPQD任务的大规模实证研究,我们希望在未来的作品中更加关注。
translated by 谷歌翻译
在基于哈希的图像检索系统中,原始的变换输入通常会产生不同的代码,降低检索精度。要缓解此问题,可以在培训期间应用数据增强。然而,即使一个内容的增强样本在真实空间中相似,量化也可以在汉明空间远离它们。这导致可以阻碍培训和降低性能的表示差异。在这项工作中,我们提出了一种新型的自蒸馏散列方案,以最小化差异,同时利用增强数据的潜力。通过将弱变换样本的哈希知识转移到强大的样本,我们使哈希代码对各种变换不敏感。我们还引入了基于哈希代理的相似度学习和基于二进制交叉熵的量化损耗,以提供优质的质量哈希代码。最终,我们构建一个深度散列框架,产生鉴别性哈希代码。基准测试的广泛实验验证了我们的自蒸馏改善了现有的深度散列方法,我们的框架达到了最先进的检索结果。代码将很快发布。
translated by 谷歌翻译
对深度学习的有效部署的强烈需求(DL)应用促使丰富的DL生态系统的快速发展。为了跟上其快速进步,对于DL框架来说至关重要,以有效地将各种优化的库和运行时作为其后端集成,并通过正确使用它们来生成最快的可执行文件。但是,当前的DL框架需要重大的手动努力来整合多样化的后果,并且通常无法提供高性能。在本文中,我们提出了一个用于集成DL后端的自动框架的拼贴画。拼贴提供后端注册界面,允许用户精确指定各个后端的功能。通过利用可用后端的规范,拼贴搜索给定工作负载和执行环境的优化后端放置。我们的评估表明,拼贴画在没有手动干预的情况下将多个后端集成在一起,并且分别在两个不同的NVIDIA GPU和英特尔CPU上以1.21倍,1.39倍,1.40倍的现有框架。
translated by 谷歌翻译
The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF ($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fr\'echet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a $\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with $\text{C}^{3}$G-NeRF.
translated by 谷歌翻译
In both terrestrial and marine ecology, physical tagging is a frequently used method to study population dynamics and behavior. However, such tagging techniques are increasingly being replaced by individual re-identification using image analysis. This paper introduces a contrastive learning-based model for identifying individuals. The model uses the first parts of the Inception v3 network, supported by a projection head, and we use contrastive learning to find similar or dissimilar image pairs from a collection of uniform photographs. We apply this technique for corkwing wrasse, Symphodus melops, an ecologically and commercially important fish species. Photos are taken during repeated catches of the same individuals from a wild population, where the intervals between individual sightings might range from a few days to several years. Our model achieves a one-shot accuracy of 0.35, a 5-shot accuracy of 0.56, and a 100-shot accuracy of 0.88, on our dataset.
translated by 谷歌翻译
Feature selection helps reduce data acquisition costs in ML, but the standard approach is to train models with static feature subsets. Here, we consider the dynamic feature selection (DFS) problem where a model sequentially queries features based on the presently available information. DFS is often addressed with reinforcement learning (RL), but we explore a simpler approach of greedily selecting features based on their conditional mutual information. This method is theoretically appealing but requires oracle access to the data distribution, so we develop a learning approach based on amortized optimization. The proposed method is shown to recover the greedy policy when trained to optimality and outperforms numerous existing feature selection methods in our experiments, thus validating it as a simple but powerful approach for this problem.
translated by 谷歌翻译
The purpose of this work was to tackle practical issues which arise when using a tendon-driven robotic manipulator with a long, passive, flexible proximal section in medical applications. A separable robot which overcomes difficulties in actuation and sterilization is introduced, in which the body containing the electronics is reusable and the remainder is disposable. A control input which resolves the redundancy in the kinematics and a physical interpretation of this redundancy are provided. The effect of a static change in the proximal section angle on bending angle error was explored under four testing conditions for a sinusoidal input. Bending angle error increased for increasing proximal section angle for all testing conditions with an average error reduction of 41.48% for retension, 4.28% for hysteresis, and 52.35% for re-tension + hysteresis compensation relative to the baseline case. Two major sources of error in tracking the bending angle were identified: time delay from hysteresis and DC offset from the proximal section angle. Examination of these error sources revealed that the simple hysteresis compensation was most effective for removing time delay and re-tension compensation for removing DC offset, which was the primary source of increasing error. The re-tension compensation was also tested for dynamic changes in the proximal section and reduced error in the final configuration of the tip by 89.14% relative to the baseline case.
translated by 谷歌翻译
According to the rapid development of drone technologies, drones are widely used in many applications including military domains. In this paper, a novel situation-aware DRL- based autonomous nonlinear drone mobility control algorithm in cyber-physical loitering munition applications. On the battlefield, the design of DRL-based autonomous control algorithm is not straightforward because real-world data gathering is generally not available. Therefore, the approach in this paper is that cyber-physical virtual environment is constructed with Unity environment. Based on the virtual cyber-physical battlefield scenarios, a DRL-based automated nonlinear drone mobility control algorithm can be designed, evaluated, and visualized. Moreover, many obstacles exist which is harmful for linear trajectory control in real-world battlefield scenarios. Thus, our proposed autonomous nonlinear drone mobility control algorithm utilizes situation-aware components those are implemented with a Raycast function in Unity virtual scenarios. Based on the gathered situation-aware information, the drone can autonomously and nonlinearly adjust its trajectory during flight. Therefore, this approach is obviously beneficial for avoiding obstacles in obstacle-deployed battlefields. Our visualization-based performance evaluation shows that the proposed algorithm is superior from the other linear mobility control algorithms.
translated by 谷歌翻译